Data Cleaning

To see all cleaned data click here

Quarterly and Annual Ridership Totals by Mode​ of Transportation 1

The purpose of this data is to gain a baseline perspective of the current state of public transit usage in the United States. Therefore, this data set should be cleaned in a way that trends can be visualized, without including superfluous information that does not relate to any current phenomena. The steps used in cleaning this data are below.

  • Trim the data set:
    • Columns 1 to 11 to trim blank items in the .csv file, as well as notes put in by the source.
    • Rows 81 to 133 to remove records from prior to 2010, as those are superfluous when comparing to current trends.
  • Create one column to account for year and quarter to improve readability
  • Convert all numeric rows to numeric data type
  • Remove extra year and quarter columns as they are now unnecessary

Regarding the numeric fields, I have chosen to keep them all for now as each one can provide insight into which modes of transportation are most affected by certain factors. Below is the code to apply the steps laid out, as well as a comparison between the raw and cleaned data sets.

Code
import pandas as pd
import datetime
from datetime import datetime
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Code
library(tidyverse)
library(tidyr)

ridership <- read.csv("../data/APTA-Ridership-by-Mode-and-Quarter-1990-Present.csv")
ridership <- ridership[81:133,1:11]
colnames(ridership)[2] <- 'Year - Quarter'
colnames(ridership)[4:11] <- c("total_ridership", "heavy_rail", "light_rail", "commuter_rail", "trolleybus", "bus", "demand_response", "other")
ridership$total_ridership <- as.numeric(gsub(",","", ridership$total_ridership))
ridership$heavy_rail <- as.numeric(gsub(",","", ridership$heavy_rail))
ridership$light_rail <- as.numeric(gsub(",","", ridership$light_rail))
ridership$commuter_rail <- as.numeric(gsub(",","", ridership$commuter_rail))
ridership$trolleybus <- as.numeric(gsub(",","", ridership$trolleybus))
ridership$bus <- as.numeric(gsub(",","", ridership$bus))
ridership$demand_response <- as.numeric(gsub(",","", ridership$demand_response))
ridership$other <- as.numeric(gsub(",","", ridership$other))
ggplot(data=ridership, aes(x=factor(`Year - Quarter`), y=total_ridership, group=1, xmin = "2015 - Q1", xmax="2023-Q1")) +
  geom_line()+
  geom_point()+
  labs(x = "Year - Quarter", y = "Total Ridership (000s)", title = "Total Public Transit Ridership in the U.S.")+
  theme(axis.text.x = element_text(angle = 45))
ridership <- ridership[c(2, 4:11)]
head(ridership)
write.csv(ridership, "../data/cleaned_data/ridership_by_quarter_cleaned.csv")

Raw Quarterly Ridership Data

Cleaned Quarterly Ridership Data

Census Data for Commute to Work 2

Note: Due to the size of this data, it will not be hosted on Github. The cleaned data can be accessed using this link.

The main objective of cleaning this data is to narrow down the fields to remove superfluous columns, and to decode the numerical values that the dataset has in place of categorical values. To do this, we will reference the glossary that accompanies the dataset. The steps are the following:

  • Remove columns that provide excess detail
  • Rename columns
  • Remove columns that will not be necessary for any analysis techniques to be used later on
  • Replace codes for sex, marital_status, race, hispanic, employment, metropolitan_status, and transportation_type
    • Codes for metropolitan_status and transportation_type are aggregated to simplify data (e.g., all public transit types are labeled Public Transit)
  • Set age and income to numerical data types
  • Set all values where income is 0 and the person is not in the labor force to NA
  • Drop all rows where transportation_type is NA, as those are not labeled
  • Set all placeholder values for city_population to NA
  • Set all placeholder values for income to NA
  • Write to .csv

The code and output are shown below:

Code
library(tidyverse)
library(tidyr)

commute <- read.csv("../data/ipums_commute.csv")
commute <- commute[-c(6,8,12,15)]
colnames(commute) <- c('city_population_00s','sex','age','marital_status','race','hispanic','citizenship','english',
                       'employment','labor_force','worker_class', 'income','transportation_type','transportation_time')
commute <- commute[,!(names(commute) %in% c('citizenship','english','labor_force','worker_class'))]
commute$sex <- replace(commute$sex, commute$sex=='1', 'Male')
commute$sex <- replace(commute$sex, commute$sex=='2', 'Female')
commute$age <- as.integer(commute$age)
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='1', 'Married present')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='2', 'Married absent')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='3', 'Separated')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='4', 'Divorced')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='5', 'Widowed')
commute$marital_status <- replace(commute$marital_status, commute$marital_status=='6', 'Never married')
commute$race <- replace(commute$race, commute$race=='1', 'White')
commute$race <- replace(commute$race, commute$race=='2', 'Black')
commute$race <- replace(commute$race, commute$race=='3', 'American Indian')
commute$race <- replace(commute$race, commute$race=='4', 'Chinese')
commute$race <- replace(commute$race, commute$race=='5', 'Japanese')
commute$race <- replace(commute$race, commute$race=='6', 'Other Asian or PI')
commute$race <- replace(commute$race, commute$race=='7', 'Other race')
commute$race <- replace(commute$race, commute$race=='8', 'Two races')
commute$race <- replace(commute$race, commute$race=='9', 'Three or more races')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='0', 'Not Hispanic')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='1', 'Mexican')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='2', 'Puerto Rican')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='3', 'Cuban')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='4', 'Other')
commute$hispanic <- replace(commute$hispanic, commute$hispanic=='9', 'Not reported')
commute$employment <- replace(commute$employment, commute$employment=='0', NA)
commute$employment <- replace(commute$employment, commute$employment=='1', 'Employed')
commute$employment <- replace(commute$employment, commute$employment=='2', 'Unemployed')
commute$employment <- replace(commute$employment, commute$employment=='3', 'Not in labor force')
commute$income <- as.integer(commute$income)
commute$income <- replace(commute$income,commute$employment=='Not in labor force' & commute$income==0, NA)
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='0', NA)
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type %in% c('10','11','12','13','14','15','20'), 'Private Vehicle')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type %in% c('31','32','33','34','35','36','37','38','39'), 'Public Transit')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='50', 'Bicycle')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='60', 'Walk')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='70', 'Other')
commute$transportation_type <- replace(commute$transportation_type, commute$transportation_type=='80', 'Work From Home')
commute$city_population_00s <- replace(commute$city_population_00s, commute$city_population_00s %in% c(0,99999), NA)
commute$income <- replace(commute$income, commute$income %in% c(-009995,-000001,0000000,0000001,9999999), NA)

commute <- commute %>% drop_na(transportation_type)
head(commute)
write.csv(commute, "../data/cleaned_data/commute_cleaned.csv")

Commute by Demographic - Cleaned

Public Transit Data by City 3

This data was gathered to attempt to find differences in public transit system performance by city. However, the raw data comes in a slightly different form. The observational unit is mode of transportation, separated by transit agency. For our purposes, we want the observational unit to be each city, so many of these rows must be consolidated. To accomplish this, it is important to understand which values are should be summed (i.e., counting variables), and which should be added as proportions of the total. For each, column, a formula must be applied to ensure proper consolidation. The steps for cleaning this dataset are as follows:

  • Remove rows with unnecessary information
  • Remove rows in which the Most Recent Report Year is not 2022, the latest year with sufficient data
  • Initialize new dataframe to insert consolidated rows
    • Length equal to the number of unique city names (i.e., number of cities)
  • Set Population to the population value associated with each city
  • Set Area to the area value in square miles associated with each city
  • Set Cost_per_trip, Fare_per_trip, and Miles_per_trip
    • As these are all average values, this is done by multiplying the value for each transportation type by the number of passenger trips to properly weigh that data point, sum all of those values, and divide by the total number of passenger trips for that city
  • Compute Trips_per_capita as total trips divided by population
  • Write the resulting dataframe to a .csv file

The code and output from cleaning this dataset are below:

Code
library(tidyverse)
library(tidyr)
library(readxl)

cities <- read_excel("../data/apta-cities_9-23.xlsx",sheet = 2)
cities <- cities[-c(2,4:6,8:12,16:17,19:20)]
cities <- cities[(cities$Status %in% "Active" & cities$`Most Recent Report Year` %in% 2022),]
unique(cities$Status)
head(cities)
nrows <- length(unique(cities$`UZA Name`))
cities2 <- data.frame(City=character(nrows), Population=numeric(nrows), Area=numeric(nrows),
                      Cost_per_trip=numeric(nrows), Fare_per_trip=numeric(nrows), Miles_per_trip=numeric(nrows))
cities2$City <- unique(cities$`UZA Name`)
for (i in 1:nrows) {
  cities2[i,2] <- as.numeric(unique(cities$`UZA Population`[cities$`UZA Name`==cities2[i,1]])[1])
}
for (i in 1:nrows) {
  cities2[i,3] <- round(as.numeric(unique(cities$`UZA SQ Miles`[cities$`UZA Name`==cities2[i,1]])[1]), digits = 2)
}
for (i in 1:nrows) {
  cities2[i,4] <- sum((cities$`Avg Cost Per Trip FY`[cities$`UZA Name`==cities2[i,1]] * cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]])
                      / sum(cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]]))
}
for (i in 1:nrows) {
  cities2[i,5] <- sum((cities$`Avg Fares Per Trip FY`[cities$`UZA Name`==cities2[i,1]] * cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]])
                      / sum(cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]]))
}
for (i in 1:nrows) {
  cities2[i,6] <- sum((cities$`Avg Trip Length FY`[cities$`UZA Name`==cities2[i,1]] * cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]])
                      / sum(cities$`Unlinked Passenger Trips FY`[cities$`UZA Name`==cities2[i,1]]))
}
cities2$Trips_per_capita <- cities2$Total_trips / cities2$Population
head(cities2)
write.csv(cities2, "../data/cleaned_data/apta_cities_cleaned.csv")

Cleaned Cities Data

Yelp Reviews 4 5 6 7 8 9 10

The purpose of cleaning this data is to perform Naive Bayes classification in the future, as we have labeled text data that can be valuable for analyzing how people express their opinions on public transit systems. In the raw data that was obtained, there are duplicates on each page which must be dealt with, as well as a need for correcting the data types. Since this process must be iterated seven times to account for each transit organization, we will create a generalized fumction to be called upon for each city involved. The steps in this function are:

  • Remove excess columns
  • Remove rows where review is duplicated (date is NA in these records, so we drop based on that)
  • Add a column to keep track of which agency the review is about
  • Change column names
  • Take just the numerical rating and set to integer type
  • Set Date field to date type
  • Append to main dataframe using pd.concat() function

The code for this function is below:

Code
def clean_yelp(x, y):
    df = pd.read_csv('../data/yelp_reviews/' + x + '_reviews.csv')
    df = df.drop(columns='Unnamed: 0')
    df = df[df['1'].notna()]
    df['Agency'] = x
    df = df.rename(columns={'0': 'Rating', '1': 'Date', '2': 'Review'})
    df['Rating'] = df['Rating'].str[0].astype(int)
    for i in df['Date']:
        i = datetime.strptime(i, "%b %d, %Y")
    total = pd.concat([y,df])
    return(total)

Next, we will call the function for each of our seven cities. The code and output for this are below:

Code
yelp_cleaned = pd.DataFrame(columns=['Rating', 'Date', 'Review', 'Agency'])
yelp_cleaned = clean_yelp('mta', yelp_cleaned)
yelp_cleaned = clean_yelp('la', yelp_cleaned)
yelp_cleaned = clean_yelp('cta', yelp_cleaned)
yelp_cleaned = clean_yelp('septa', yelp_cleaned)
yelp_cleaned = clean_yelp('mbta', yelp_cleaned)
yelp_cleaned = clean_yelp('bart', yelp_cleaned)
yelp_cleaned = clean_yelp('wmata', yelp_cleaned)
yelp_cleaned.to_csv('../data/cleaned_data/yelp_cleaned.csv')

Cleaned Yelp Reviews

Footnotes

  1. “Ridership Report.” American Public Transportation Association, 21 Sept. 2023, www.apta.com/research-technical-resources/transit-statistics/ridership-report/.↩︎

  2. Steven Ruggles, Sarah Flood, Matthew Sobek, Danika Brockman, Grace Cooper, Stephanie Richards, and Megan Schouweiler. IPUMS USA: Version 13.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D010.V13.0↩︎

  3. “Raw monthly ridership (no adjustments or estimates),” Raw Monthly Ridership (No Adjustments or Estimates) | FTA, https://www.transit.dot.gov/ntd/data-product/monthly-module-raw-data-release (accessed Nov. 14, 2023).↩︎

  4. “Metropolitan Transportation Authority - New York, NY,” Yelp, https://www.yelp.com/biz/metropolitan-transportation-authority-new-york-6 (accessed Nov. 14, 2023).↩︎

  5. “Metro Los Angeles - Los Angeles, CA,” Yelp, https://www.yelp.com/biz/wmata-washington (accessed Nov. 14, 2023).↩︎

  6. “Chicago Transit Authority - Chicago, IL,” Yelp, https://www.yelp.com/biz/metro-los-angeles-los-angeles (accessed Nov. 14, 2023).↩︎

  7. “Septa - Philadelphia, PA,” Yelp, https://www.yelp.com/biz/septa-philadelphia-7 (accessed Nov. 14, 2023).↩︎

  8. “Massachusetts Bay Transportation Authority - Boston, MA,” Yelp, https://www.yelp.com/biz/massachusetts-bay-transportation-authority-boston (accessed Nov. 14, 2023).↩︎

  9. “WMATA - Washington, DC, DC,” Yelp, https://www.yelp.com/biz/wmata-washington (accessed Nov. 2, 2023).↩︎

  10. “Bart - Bay Area Rapid Transit - Oakland, CA,” Yelp, https://www.yelp.com/biz/bart-bay-area-rapid-transit-oakland-2 (accessed Nov. 2, 2023).↩︎

  11. Barrero, Jose Maria, et al. Why Working from Home Will Stick, 2021, https://doi.org/10.3386/w28731.↩︎